src/AllCells/[RC17]DoubletFinder.Rmd · MonolayerCultures

---
title: "[RC17] Marking Doublets"
author: "Nina-Lydia Kazakou"
date: "04/03/2022"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Set-up
### Load libraries
```{r message=FALSE, warning=FALSE}
library(SingleCellExperiment)
library(Seurat)
library(tidyverse)
library(Matrix)
library(scales)
library(cowplot)
library(RCurl)
library(scDblFinder)
library(Matrix)
library(ggplot2)
library(edgeR)
library(dplyr)
library(ggsci)  
library(here)
library(scater)
library(scran)
```

### Colour Palette 
```{r load_palette}
mypal <- pal_npg("nrc", alpha = 0.7)(10)
mypal2 <-pal_tron("legacy", alpha = 0.7)(7)
mypal3<-pal_lancet("lanonc", alpha = 0.7)(9)
mypal4<-pal_simpsons(palette = c("springfield"), alpha = 0.7)(16)
mypal5 <- pal_rickandmorty(palette = c("schwifty"), alpha = 0.7)(6)
mypal6 <- pal_futurama(palette = c("planetexpress"), alpha = 0.7)(5)
mypal7 <- pal_startrek(palette = c("uniform"), alpha = 0.7)(5)
mycoloursP<- c(mypal, mypal2, mypal3, mypal4)
```

### Load object
```{r}
norm_sce <- readRDS(here("data", "Processed", "AllCells", "RC17_norm_sce.rds"))
```


# Identify Doublets
scDFinder simuates doublets by randomly mixing the transcription profile of cells in a dataset and then compares real samples to those simulated ones.
It allocates doublet scores to individual cells and not clusters, therefore the scores remain unchanged even if the clustering is changed.

Model variance and select top 10% most variable genes:
```{r}
set.seed(1000)
dec_sce <- modelGeneVarByPoisson(norm_sce)
top_genes <- getTopHVGs(dec_sce, prop=0.1)

set.seed(10000)
norm_sce <- denoisePCA(norm_sce, 
                           subset.row=top_genes, 
                           technical=dec_sce)

set.seed(100000)
norm_sce <- runTSNE(norm_sce, 
                        dimred="PCA",
                        pca = 25)
```


Do a quick clustering:
```{r}
g <- buildSNNGraph(norm_sce, k=10, use.dimred = 'PCA')
clust <- igraph::cluster_walktrap(g)$membership
colLabels(norm_sce) <- factor(clust)

table(colLabels(norm_sce))

plotTSNE(norm_sce, colour_by="label")
```

Ideally, scDblFinder should be performed after empty droplets are removed and before further filtering is performed. But on the other hand the authors say that it works with normalised data and with dimensional reduced data which are steps that are usually performed after QC. 

It does not make sense that doublets form between samples that are run on different chip. Therefore we specify a sample identifier and the algorithm first runs within sample. 
```{r}
set.seed(100)

norm_sce <- scDblFinder(norm_sce, samples="Sample")
table(norm_sce$scDblFinder.class)
```
This number is already changed compared to the original analysis; This could be quite common, as the scDblFinder calculates the simulated doublets every time, meaning that the result will change every time that I run the algorithm. 
Original run results: #siglet: 18066, #doublet: 1396
However, it still possible, that due to changes in the version of R and updates of the packages, I will not manage to get the exact same results with my original analysis. 

### Create an srt object:
```{r}
RC17_norm <- as.Seurat(norm_sce)
head(RC17_norm@assays$RNA@counts) # raw data (integers)
head(RC17_norm@assays$RNA@data) # log normalized counts (decimal)
```

```{r}
write.csv(as.data.frame(colData(norm_sce)), here("outs", "Processed", "AllCells", "RC17_doubletscore.csv"))
```


```{r}
saveRDS(RC17_norm, here("data", "Processed", "AllCells", "RC17_norm_srt.rds"))
```